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Abstract. The suffix tree is a very important data structure in string processing, but it 
suffers from a huge space consumption. In large-scale applications, compressed suffix trees 
(CSTs) are therefore used instead. A CST consists of three (compressed) components: the 
suffix array, the LCP-array, and data structures for simulating navigational operations on the 
suffix tree. The LCP-array stores the lengths of the longest common prefixes of lexicograph- 
ically adjacent suffixes, and it can be computed in linear time. In this paper, we present 
new LCP-array construction algorithms that are fast and very space efficient. In practice, our 
algorithms outperform the currently best algorithms. 

1 Introduction 

A suffix tree for a string S of length re is a compact trie storing all the suffixes of S. It is an 
extremely important data structure with applications in string matching, bioinformatics, 
and document retrieval, to mention only a few examples; see e.g. [2]. The drawback of 
suffix trees is their huge space consumption of about 20 times the text size, even in carefully 
engineered implementations. To reduce their size, several authors provided compressed suffix 
trees (CSTs); see e.g. |12| and [8] for a survey. A CST of S can be divided into three 
components: (1) the suffix array SA, specifying the lexicographic order of <S"s suffixes, (2) the 
LCP-array, storing the lengths of the longest common prefixes of lexicographically adjacent 
suffixes, and (3) additional data structures for simulating navigational operations on the 
suffix tree. 

Particular emphasis has been put on efficient construction algorithms for all three com- 
ponents of CSTs. (Here, "efficiency" encompasses both construction time and space, as the 
latter can cause a significant memory bottleneck.) This is especially true for the first com- 
ponent. In the last decade, much effort has gone into the developement of efficient suffix 
array construction algorithms (SAC As); see |10| for a survey. Although linear-time direct 
SACAs were known since 2003, experiments showed |10) that these were outperformed in 
practice by SACAs having a worst-case time compexity of 0(n 2 log n). To date, however, 
the fastest SACA is a linear time algorithm |9]. Interestingly, for ASCII alphabet its speed 
can compete with the fastest LCP-array construction algorithms (LAC A) which uses equal 
or less space. This is somewhat surprising because sorting all suffixes seems to be more 
difficult than computing lcp- values. 

As discussed in Section [2l today's best LACAs |3|5] are linear time algorithms, but they 
suffer from a poor locality behavior. In this paper, we present two very space efficient (using 



n or 2n bytes only) and fast LACAs. Based on the observation that one cache miss takes 
approximately the time of 20 character comparisons, we try to trade character comparisons 
for cache misses. The algorithms use the text (string) S, the suffix array, and the Burrows- 
Wheeler Transform (BWT). Since most CSAs are based on the BWT anyway, we basically 
get it for free. Our experiments show the significance of the algorithms. More precisely, 
experimental results show that our algorithms outperform state-of-the-art algorithms |3|5| . 
For large texts they are always faster than the previously best algorithms. The superiority 
of our new LACAs varies with the text size (the larger the better), the alphabet size (the 
smaller the better), the number of "large" values in the LCP-array (the less the better), 
and the runs in the BWT (the more the better). The algorithms work particularly well on 
two types of data that are of utmost importance in practice: long DNA sequences (small 
alphabet size) and large collections of XML documents (long runs in the BWT) . 

2 Related work 

In their seminal paper [6], Manber and Myers did not only introduce the suffix array but 
also the longest-common-prefix (LCP) array. They showed that both the suffix array and the 
LCP-array can be constructed in 0(n log n) time for a string of length n. Kasai et al. [5] gave 
the first linear time algorithm for the computation of the LCP-array. Their algorithm uses 
the string S, the suffix array, the inverse suffix array, and of course the LCP-array. Each of 
the arrays requires An bytes (under the assumption that n < 2 32 ), thus the algorithms needs 
13n bytes in total (for an ASCII alphabet). The main advantage of their algorithm is that it 
is simple and uses at most In character comparisons. But its poor locality behavior results 
in many cache misses, which is a severe disadvantage on current computer architectures. 
Manzini [7] reduced the space occupancy of Kasai et al.'s algorithm to 9n bytes with a 
slow down of about 5% — 10%. He also proposed an even more space-efficient (but slower) 
algorithm that overwrites the suffix array. Recently, Karkkainen et al. [3j proposed another 
variant of Kasai et al.'s algorithm, which computes a permuted LCP-array (PLCP-array). 
In the PLCP-array, the lcp-values are in text order (position order) rather than in suffix 
array order (lexicographic order). This algorithm takes only 5n bytes and is much faster 
than Kasai et al.'s algorithm because it has a much better locality behavior. However, in 
virtually all applications lcp-values are required to be in suffix array order, so that in a 
final step the PLCP-array must be converted into the LCP-array. Although this final step 
suffers (again) from a poor locality behavior, the overall algorithm is still faster than Kasai 
et al.'s. In a different approach, Puglisi and Turpin jTTj tried to avoid cache misses by using 
the difference cover method of Karkkainen and Sanders [JJ. The worst case time complexity 
of their algorithm is 0(nv) and the space requirement is n + 0(n/y/v + v) bytes, where v 
is the size of the difference cover. Experiments showed that the best run-time is achieved 
for v = 64, but the algorithm is still slower than Kasai et al.'s. This is because it uses 
constant time range minimum queries, which take considerable time in practice. To sum 
up, the currently best LACA is that of Karkkainen et al. [3]. 



3 Preliminaries 



Let 27 be an ordered alphabet whose smallest element is the so-called sentinel character $. 
If 27 consists of a characters and is fixed, then we may view 27 as an array of size a such 
that the characters appear in ascending order in the array E[0..a — 1], i.e., 27 [0] = $ < 
27[1] < . . . < U[a — 1]. In the following, S is a string of length n over 27 having the sentinel 
character at the end (and nowhere else). For < i < n — 1, S[i] denotes the character at 
position i in S. For i < j, S[i..j] denotes the substring of S starting with the character at 
position i and ending with the character at position j. Furthermore, Si denotes the suffix 
S[i..n — 1] of S. The suffix array SA of the string S is an array of integers in the range 
to n — 1 specifying the lexicographic ordering of the n suffixes of the string S, that is, it 
satisfies SsAfo] < <SsA[l] < • • • < SsAfn-llj see Fig. [TJ for an example. In the following, ISA 
denotes the inverse of the permutation SA. 

The LCP-array is an array containing the lengths of the longest common prefix between 
every pair of consecutive suffixes in SA. We use lcp(u, v) to denote the length of the longest 
common prefix between strings u and v. Thus, the lcp-array is an array of integers in the 
range to n such that LCP[0] = —1, LCP[?i] = —1, and LCP[i] = lcp(SsA[i-i]i <SsA[il) for 
1 < i < re — 1; see Fig. [TJ For i < j, a range minimum query RMQ(i, j) on the interval 
in the LCP-array returns an index k such that LCP[fc] = min{LCP[/] | i < I < j}. It is not 
difficult to show that lcp(S SA[i] , S SA[j] ) = LCP[RMQ(i + 1, j)]. 

The Burrows and Wheeler transform [1] converts a string S into the string BWT[0..ti— 1] 
defined by BWT[i] = 5[SA[i] - 1] for all i with SA[f] + and BWT[i] = $ otherwise; see Fig. 
[U The LF-mapping is defined by LF[i] = ISA[SA[i] - 1] for all % with SA[i] + and LF[i] = 
otherwise; see Fig. [TJ Its long name last-to-first column mapping stems from the fact that it 
maps the last column L = BWT to the first column F, where F contains the first character 
of the suffixes in the suffix array, i.e., F[i] = S[SA[i]\. More precisely, if BWT[i] = c is the 
A;-th occurrence of character c in BWT, then j = LF[i] is the index such that F[j] is the 
&-th occurrence of c in F. The LF-mapping can be implemented by LF[i] = C[c] + occ(c, i), 
where c = BWT[i], C[c] is the overall number (of occurrences) of characters in 5 which 
are strictly smaller than c, and occ(c, i) is the number of occurrences of the character c in 
BWT[l.i]. 

4 First algorithm 

In this section, we present our first LACA. A pseudo-code description can be found in 
Algorithm [TJ and an application of it is illustrated in Fig. [TJ Furthermore, Theorem [TJ does 
not only prove its correctness but also explains it. The algorithm is based on Lemma [TJ 
which in turn requires the following definition. 
Define a function prev by 

prev(i) = max{j | < j < i and BWT [7] = BWT[i]} 
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Fig. 1. Left: Suffix array, LCP-array, LF-mapping, and BWT of the string S = 
el_anele_lepanelen$. Right: The LCP-array after the j'th iteration of Algorithm Q] (omit- 
ted entries are not computed yet). 



where prev{i) = — 1 if the maximum is taken over an empty set. Intuitively if we start 
at index i and scan the BWT upward, then prev(i) is the first index at which the same 
character BWT[il occurs. 



Lemma 1. 

LCP[LF[i]] 



0, if prev(i) = —1 

1 + LCP[RMQ(prei)(i) + otherwise 



Proof. If prev(i) = — 1, then <Slfm ^ s the lexicographically smallest suffix among all suffix 
having BWT[i] as first character. Hence LCP[LF[i]] = 0. Otherwise, LF[preu(i)] = LF[i] — 1. 
In this case, it follows that 

LCP[LF[i]] = Zqp(£sA[LF[i]-l])SsA[LF[i]]) = lc P(^SA[LF[prev(i)]] , <SsA[LF[i]] ) 

= 1 + lcp(S S A]prev{i)]i SsA[i\) = 1 + LCP[RMQ(preu(i) + l,i)] 



Theorem 1. Algorithm^ correctly computes the LCP-array. 

Proof. Under the assumption that all entries in the LCP-array in the first i — 1 iterations 
of the for-loop have been computed correctly, we consider the i-th iteration and prove: 



Algorithm 1 Construction of the LCP-array. 



01 last_occ[0..o- - 1] «- [-1, -1, . . - , -1] 

02 LCP[0] «- -1; LCP[n] <- -1; LCP[LF[0]] <- 

03 for i 1 to n — 1 do 

04 if LCP[i] = _L then /* LCP[i] is undefined */ 

05 e<-o 

06 if LF[i] < i then 

07 ^ <- max{LCP[LF[i]] - 1, 0} 

08 if BWT[i] = BWT[i - 1] then 

09 continue at line 12 

10 while S[SA[i] + i\ = 5[SA[i - 1] + £\ do 
11 

12 LCP[i] <- £ 

13 if LF[i] > i then 

14 LCP[LF[z]] <- LCP[RMQ(last_occ[BWT[z]] + l,i)] + 1 

15 last_occ[BWT[i]] <- i 



1. If LCP[i] = _L, then the entry LCP[i] will be computed correctly. 

2. If LF[i] > i, then the entry LCP[LF[z]] will be computed correctly. 

(1) If the if-condition in line 6 is not true, then SsaU— l] an d 5*SA[j] are compared character 
by character (lines 10-11) and LCP[i] is assigned the correct value in line 12. Otherwise, 
if the if-condition in line 6 is true, then £ is set to max{LCP[LF[i]] — 1,0}. We claim that 
I < LCP[«]. This is certainly true if I = 0, so suppose that i = LCP[LF[z]] - 1 > 0. 
According to (the proof of) Lemma [TJ LCP[LF[i]] — 1 = lcp(Ssk\prev(i)]i ^SMi])' Obviously, 
lcp(S S A\prev(i)], S S A[{\) < lcp(S S A[i-l] , SsA[i\), so tlle claim follows. 

Now, if BWT[i] ^ BWT[i — 1], then SsA[j-i] an( i ^SA[i] are compared character by 
character (lines 10-11), but the first £ characters are skipped because they are identi- 
cal. Again, LCP[i] is assigned the correct value in line 12. Finally, if BWT[i] = BWT[i — 
1], then prev(i) = i — 1. This, in conjunction with Lemma [TJ yields LCP[LF[i]] — 1 = 
lcp(SsA{ P rev(i)}, SsA{i\) = lcp(S S A[i-i] , S S a[{\ ) = LCP[i]. Thus, £ = LCP[LF[i]] - 1 is already 
the correct value of LCP[z]. So lines 10-11 can be skipped and the assignment in line 12 is 
correct. 

(2) In the linear scan of the LCP-array, we always have last_occ[BWT[i]] = prev(i). 
Therefore, it is a direct consequence of Lemma Q] that the assignment in line 14 is correct. 

We still have to explain how the index j = RMQ(last_occ[BWT[z]] + l,i)) and the lcp- 
value LCP[j] in line 14 can be computed efficiently. To this end, we use a stack K of size 
0(a). Each element on the stack is a pair consisting of an index and an lcp- value. We first 
push (0, —1) onto the initially empty stack K. It is an invariant of the for-loop that the 
stack elements are strictly increasing in both components (from bottom to top). In the ith 
iteration of the for-loop, before line 13, we update the stack K by removing all elements 
whose lcp-value is greater than or equal to LCP[i]. Then, we push the pair (i, LCP[i]) onto 



K. Clearly, this maintains the invariant. Let x = last_occ[BWT[f]] + 1. The answer to 
RMQ(x,i) is the pair (j,£) where j is the minimum of all indices that are greater than or 
equal to x. This pair can be found by an inspection of the stack. Moreover, the lcp-value 
LCP[x] + 1 we are looking for is £ + 1. To meet the 0(c) space condition of the stack, we 
check after each crth update if the size s of K is greater than a. If so, we can remove s — a 
elements from K because there are at most a possible queries. With this strategy, the stack 
size never exceeds 2a and the amortized time for the updates is G(n). Furthermore, an 
inspection of the stack takes 0(a) time. In practice, this works particularly well when there 
is a run in the BWT because then the element we are searching for is on top of the stack. 

Algorithm Q] has a quadratic run time in the worst case, consider e.g. the string S = 
ababab...ab%. 

At first glance, Algorithm Q] does not have any advantage over Kasai et al.'s algorithm 
because it holds S, SA, LF, BWT, and LCP in main memory. A closer look, however, reveals 
that the arrays SA, LF, and BWT are accessed sequentially in the for-loop. So they can be 
streamed from disk. We cannot avoid the random access to S, but that to LCP as we shall 
show next. 

Most problematic are the "jumps" upwards (line 7 when LF[z] < i) and downwards (line 
14 when LF[i] > i). The key idea is to buffer lcp- values in queues (FIFO data structures) 
and to retrieve them when needed. 

First, one can show that the condition LCP[i] = _L in line 4 is equivalent to i > C[F[i\] + 
occ(B\NT[i],i). The last condition can be evaluated in constant time and space (increment 
a counter cnt(B\NT[i]) in iteration i), so it can replace LCP[i] = _L in line 4. This is 
important because in case j = LF[i] > i, the value LCP [7] is still in one of the queues 
and has not yet been written to the LCP-array. In other words, when we reach index j, 
we still have LCP[j] = _L although LCP [7] has already been computed. Thus, by the test 
i > C[F[j]] + occ(BWT[j], 1) we can decide whether LCP [7] has already been computed or 
not. 

Second, LF[i] lies in between C[BWT[i]] and C[BWT[i]]+occ(BWT[i],n-l), the interval 
of all suffixes that start with character BWT[z]. Note that there are at most a different such 
intervals. We exploit this fact in the following way. For each character c € S we use a queue 
Q c . During the for-loop we add (enqueue) the values LCP[C[c]], LCP[C[c]+l], . . . , LCP[C[c] + 
occ(c, n — 1)] in exactly this order to Q c . In iteration i, an operation enqueue(Q c , x) is done 
for c = BWT[i] and x = LCP[RMQ(last_occ[BWT[z]] + + 1 in line 14 provided that 
LF[i] > i, and in line 12 for c = F[i] and x = t. Also in iteration i, an operation dequeue(Q c ) 
is done for c = BWT[i] in line 7 provided that LF[i] < %, This dequeue operation returns the 
value LCP[LF[i]] which is needed in line 7. Moreover, if i < C[F[i]] + occ(BWT[i], i), then 
we know that LCP[i] has been computed previously but is still in one of the queues. Thus, 
an operation dequeue(Q c ) is done for c = F[i] immediately before line 13, and it returns 
the value LCP[i]. 

The space used by the algorithm now only depends on the size of the queues. We use 
constant size buffers for the queues and read/write the elements to/from disk if the bufferes 



are full/empty (this even allows to answer an RMQ by binary search in 0(log(<r)) time). 
Therefore, only the text S remains in main memory and we obtain an n bytes semi-external 
algorithm. 

5 Improved algorithm 

Our experiments showed that even a careful engineered version of Algorithm Q] does not 
always beat the currently fastest LACA [3]. For this reason, we will now present another 
algorithm that uses a modification of Algorithm [1] in its first phase. This modified version 
computes each LCP-entry whose value is smaller than or equal to m, where m is a user- 
defined value. (All we know about the other entries is that they are greater than m.) It can 
be obtained from Algorithm Q] by modifying lines 8, 10, and 14 as follows: 

08 if BWT[i] = BWT[f - 1] and £ < m then 

10 while 5[SA[i] + £] = 5[SA[i - 1] + £] and Km + ldo 

14 LCP[LF[i]] <- min{LCP[RMQ(last_occ[BWT[i]] + l,t)] + l,m + 1} 

In practice, m = 254 is a good choice because LCP- values greater than m can be marked 
by the value 255 and each LCP-entry occupies only one byte. Because the string S must 
also be kept in main memory, this results in a total space consumption of 2n bytes. 

Let / =[i|0<i<n and LCP[«] > m] be an array containing the indices at which the 
values in the LCP-array are > m after phase 1. In the second phase we have to calculate the 
remaining ni = \I\ many LCP-entries, and we use Algorithm [2] for this task. In essence, this 
algorithm is a combination of two algorithms presented in |3] that compute the PLCP-array: 
(a) the linear time ^-algorithm and (b) the 0(n log n) time algorithm based on the concept 
of irreducible lcp- values. Let us recapitulate necessary definitions from [3]- 

Definition 1. For all i with 1 < i < n — 1 let <F[SIK[i]] = Sk[i — 1], and for all j with 
< j < n — 1 let PLCP[j] = lcp(Sj, S$y]). An entry PLCP[j] , where j > 0, is called 
reducible if S[j — 1] = 5'[<P[j] — 1]; otherwise it is irreducible. 

Note that PLCP[SA[i]] is reducible if and only if BWT[i] = BWT[i - 1]. This is because 
BWT[t] = 5[SA[t] - 1] and BWT[i - 1] = 5[SA[t - 1] - 1] = 5[<P[SA[i]] - 1]. 

Lemma 2. For all j with < j < n — 1, we have PLCP[j] > PLCP[j — 1] — 1. Moreover, 
if PLCP[j] is reducible, then PLCP[j] = PLCP[j - 1] - 1. 

Proof. See [5l7f3] . 

The preceding lemma has the following two consequences: 

— If we compute an entry PLCP[j] (where j varies from 1 to n — 1), then PLCP[j — 1] 
many character comparisons can be skipped. This is the reason for the linear run time 
of Algorithm El cf. [SE]. 



— If we know that PLCP[j] is reducible, then no further character comparison is needed 
to determine its value. At first glance this seems to be unimportant because the next 
character comparison will yield a mismatch anyway. At second glance, however, it turns 
out to be important because the character comparison may result in a chache miss! 
(Note that in contrast to the 0{n log n) time algorithm in [3], the ^-algorithm does not 
make use of this property.) 



Algorithm 2 Phase 2 of the construction of the LCP-array. (In practice SA[n— 1] can be 
used for the undefined value _L because the entries in the <£-array are of the form SA[z — 1], 
i.e., SA[n — 1] does not occur in the <?-array.) 

01 b[0..n- 1] <- [0,0, ...,0] 

02 for i <— to n — 1 do 

03 if LCP[i] > m then 

04 fc[SA[i]] <— 1 /* the 6-array can be computed in phase 1 already */ 

05 <P[0..rai - 1] <- LL,_L,...,±] 

06 initialize a rank data structure for b 

07 for i «- to n - 1 do /* stream SA, LCP, and BWT from disk */ 

08 if LCP[i] > m and BWT[i] / BWT[i-l] then /* PLCP[SA[i]] is irreducible */ 

09 $[ranki(SA[i])] <- SA[i - 1] 

10 ji <- 

11 £<-m+l 

12 PLCP[0..n/ -1] «- [0,0,..., 0] 

13 for j to n — 1 do /* left-to-right scan of 6 and S, but random access to S */ 

14 if b[j] = 1 then 

15 if j j= and b\j-l] = 1 then 

16 t <— I — 1 /* at least £ — 1 characters match by Lemma [2] */ 

17 else 

18 I A— m + 1 /* at least m + 1 characters match by phase 1 */ 

19 if / i- then /* PLCP[j/] is irreducible */ 

20 while S[j +£] = S[$\ji] + C[ do 

21 e*-e + 1 

22 PLCP[jj] -s— £ /* if PLCP[j/] is reducible, no character comparison was needed */ 

23 j! «- ji + 1 

24 for i ^— to 7i — 1 do /* stream SA and LCP from disk */ 

25 if LCP [i] > m then 

26 LCP[i] <- PLCP[mnfci(SA[i])] 



Algorithm [2] uses a bit array 6, where 6[SA[i]] = if LCP[i] is known already (i.e., 
b[j] = if PLCP[j] is known) and 6[SA[i]] = 1 if LCP[i] still must be computed (i.e., b[j] = 1 
if PLCP[j] is unknown); see lines 1-4 of the algorithm. In contrast to the ^-algorithm [3], 



our algorithm does not compute the whole #- array (PLCP-array, respectively) but only the 
ni many entries for which the lcp- value is still unknown (line 5). So if we would delete the 
values <P\j] (PLCP[j], respectively) for which b[j] = from the original <£-array (PLCP-array, 
respectively) [3], we would obtain our array ^[0..n/ — 1] (PLCP[0..n/ — 1], respectively). We 
achieve a direct computation of ^[0..n/ — 1] with the help of a rank data structure for 
the bit array b such that rank queries rank\(j) can be answered in constant time, where 
ranki(j) returns the number of l's up to position j in b. The for-loop in lines 7-9 fills our 
array <P[0..nj — 1] but again there is a difference to the original <£-array: reducible values 
are omitted! After initialization of the counter jj, the number £ (of characters that can be 
skippped), and the PLCP array, the for-loop in lines 13-23 fills the array PLCP[0..n/ — 1] 
by scanning the 6-array and the string S from left to right. In line 14, the algorithm tests 
whether the lcp- value is still unknown (this is the case if b[j] = 1). If so, it determines 
the number of characters that can be skippped in lines 15-18. If PLCP[j/] is irreducible 
(equivalently, <&[ji] ^ _L) then its correct values is computed by character comparisons in 
lines 20-21. Otherwise, PLCP[j/] is reducible and PLCP[j/] = PLCP[j/ - 1] - 1 by LemmaEl 
In both cases PLCP[jj] is assigned the correct value in line 22. Finally, the missing entries 
in the LCP-array (lcp-values in suffix array order) are filled with the help of PLCP-array 
(lcp- values in text order) in lines 24-26. 

Clearly, the first phase of our algorithm has a linear worst-case time complexity. The 
same is true of the second phase as explained above. Thus, the whole algorithm has a linear 
run-time. 
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